This project focuses on analyzing the historical gas price trend and predict the future gas price. As we know, the prices at the pump started rising once the lockdowns were lifted and have spiraled faster since the start of the war. Our idea is to predict weekly gas prices using indicators like crude oil price, stocks, import/export etc.
The data source details are mentioned as follows. Historical gas price - https://www.eia.gov/opendata/qb.php?sdid=PET.EMM_EPM0U_PTE_NUS_DPG.W Crude oil spot price - https://www.eia.gov/opendata/qb.php?sdid=PET.RWTC.D Stocks of gasoline - https://www.eia.gov/opendata/qb.php?sdid=PET.WGFSTUS1.W Supply of gasoline - https://www.eia.gov/opendata/qb.php?sdid=PET.W_EPM0_VSD_NUS_DAYS.W Refinery capacity - https://www.eia.gov/opendata/qb.php?sdid=PET.WPULEUS3.W Crude oil exports - https://www.eia.gov/opendata/qb.php?sdid=PET.WCREXUS2.W Crude oil imports - https://www.eia.gov/opendata/qb.php?sdid=PET.WCRIMUS2.W
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(TSstudio)
## Warning: package 'TSstudio' was built under R version 4.1.3
library(corrplot)
## corrplot 0.92 loaded
library(skimr)
library(tidyr)
library(stringr)
library(caret)
## Loading required package: lattice
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(forcats)
library(car)
## Warning: package 'car' was built under R version 4.1.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
gas_price <- read.csv("./data/Gas_Price_Data_Merged.csv")
Check the structure of the gas price merged data
print(str(gas_price))
## 'data.frame': 1006 obs. of 9 variables:
## $ Date : chr "03/25/2022" "03/18/2022" "03/11/2022" "02/25/2022" ...
## $ Crude_price_per_barrel : num 113.7 100.4 113.4 92.2 92.9 ...
## $ Stocks_per_thousand_barrels : int 18971 18344 17123 17714 17624 17746 18979 19797 19732 16986 ...
## $ No_of_days : num 27.3 27 27.3 28 28.5 28.7 30.4 30.2 29 25.6 ...
## $ Utilization_percentage : num 92.1 91.1 90.4 87.7 87.4 85.3 86.7 87.7 88.1 89.8 ...
## $ Exp_thousand_barrels_per_day: int 2988 3844 2936 3796 2686 2271 2376 2796 2610 2554 ...
## $ Imp_thousand_barrels_per_day: int 6259 6486 6395 5767 6828 5790 7085 6236 6745 5884 ...
## $ Gas_price_date : chr "03/28/2022" "03/21/2022" "03/14/2022" "02/28/2022" ...
## $ Gas_price_per_gallon : num 4.15 4.16 4.25 3.55 3.48 ...
## NULL
gas_price$Date <- as.Date(gas_price$Date, format="%m/%d/%Y")
gas_price$Gas_price_date<-as.Date(gas_price$Gas_price_date, format="%m/%d/%Y")
print(str(gas_price))
## 'data.frame': 1006 obs. of 9 variables:
## $ Date : Date, format: "2022-03-25" "2022-03-18" ...
## $ Crude_price_per_barrel : num 113.7 100.4 113.4 92.2 92.9 ...
## $ Stocks_per_thousand_barrels : int 18971 18344 17123 17714 17624 17746 18979 19797 19732 16986 ...
## $ No_of_days : num 27.3 27 27.3 28 28.5 28.7 30.4 30.2 29 25.6 ...
## $ Utilization_percentage : num 92.1 91.1 90.4 87.7 87.4 85.3 86.7 87.7 88.1 89.8 ...
## $ Exp_thousand_barrels_per_day: int 2988 3844 2936 3796 2686 2271 2376 2796 2610 2554 ...
## $ Imp_thousand_barrels_per_day: int 6259 6486 6395 5767 6828 5790 7085 6236 6745 5884 ...
## $ Gas_price_date : Date, format: "2022-03-28" "2022-03-21" ...
## $ Gas_price_per_gallon : num 4.15 4.16 4.25 3.55 3.48 ...
## NULL
summary(gas_price)
## Date Crude_price_per_barrel Stocks_per_thousand_barrels
## Min. :1994-11-25 Min. : 3.32 Min. : 16585
## 1st Qu.:2001-09-22 1st Qu.: 27.87 1st Qu.: 27932
## Median :2008-08-04 Median : 50.77 Median :101114
## Mean :2008-07-28 Mean : 53.45 Mean : 94755
## 3rd Qu.:2015-05-27 3rd Qu.: 72.92 3rd Qu.:153529
## Max. :2022-03-25 Max. :139.95 Max. :180559
## No_of_days Utilization_percentage Exp_thousand_barrels_per_day
## Min. :19.80 Min. : 56.00 Min. : 10.0
## 1st Qu.:23.10 1st Qu.: 87.10 1st Qu.: 30.0
## Median :24.20 Median : 90.35 Median : 95.0
## Mean :24.62 Mean : 89.69 Mean : 561.2
## 3rd Qu.:25.60 3rd Qu.: 93.30 3rd Qu.: 420.0
## Max. :48.70 Max. :100.50 Max. :4462.0
## Imp_thousand_barrels_per_day Gas_price_date Gas_price_per_gallon
## Min. : 4599 Min. :1994-11-28 Min. :0.926
## 1st Qu.: 7514 1st Qu.:2001-09-25 1st Qu.:1.452
## Median : 8388 Median :2008-08-07 Median :2.292
## Mean : 8404 Mean :2008-07-31 Mean :2.270
## 3rd Qu.: 9414 3rd Qu.:2015-05-30 3rd Qu.:2.858
## Max. :11324 Max. :2022-03-28 Max. :4.252
Each row of this dataset represents weekly historical information on variables that affect gas prices, from 1994 to 2022. These variables include the crude price per barrel, stocks per thousand barrels, gas price per gallon, and more. There are not any NA values in this dataset, which is also demonstrated below.
any(is.na(gas_price))
## [1] FALSE
Equivalently,
all(!is.na(gas_price))
## [1] TRUE
Here we show summary statistics for gas prices throughout this dataset.
## Summary Statistics
cat(sprintf("Mean and Median for gas prices are: %.2f and %.2f \n", mean(gas_price$Gas_price_per_gallon), median(gas_price$Gas_price_per_gallon)))
## Mean and Median for gas prices are: 2.27 and 2.29
cat(sprintf("Standard Deviation for gas price is: %.2f \n", sd(gas_price$Gas_price_per_gallon)))
## Standard Deviation for gas price is: 0.85
cat(sprintf("Min, Max and Range for gas prices are: %.2f, %.2f and %.2f \n", min(gas_price$Gas_price_per_gallon), max(gas_price$Gas_price_per_gallon), max(gas_price$Gas_price_per_gallon)-min(gas_price$Gas_price_per_gallon)))
## Min, Max and Range for gas prices are: 0.93, 4.25 and 3.33
cat(sprintf("95th and 99th percentile for gas prices are: %.2f and %.2f \n", quantile(gas_price$Gas_price_per_gallon, 0.95), quantile(gas_price$Gas_price_per_gallon, 0.99)))
## 95th and 99th percentile for gas prices are: 3.70 and 3.96
cat(sprintf("IQR for gas price is: %.2f \n", IQR(gas_price$Gas_price_per_gallon)))
## IQR for gas price is: 1.41
cat(sprintf("Coefficient of variation for gas price is: %.2f \n", sd(gas_price$Gas_price_per_gallon)/mean(gas_price$Gas_price_per_gallon)))
## Coefficient of variation for gas price is: 0.37
gas_price %>%
arrange(desc(Gas_price_per_gallon)) %>%
head(n=10)
## Date Crude_price_per_barrel Stocks_per_thousand_barrels No_of_days
## 1 2022-03-11 113.39 17123 27.3
## 2 2022-03-18 100.43 18344 27.0
## 3 2022-03-25 113.69 18971 27.3
## 4 2008-07-11 139.95 107192 22.9
## 5 2008-06-27 137.00 106707 22.6
## 6 2008-06-13 134.80 106241 22.5
## 7 2008-07-18 135.37 108249 23.2
## 8 2008-06-20 134.34 106292 22.5
## 9 2008-05-30 128.47 105616 22.5
## 10 2008-05-23 130.14 103839 22.1
## Utilization_percentage Exp_thousand_barrels_per_day
## 1 90.4 2936
## 2 91.1 3844
## 3 92.1 2988
## 4 89.5 27
## 5 89.2 27
## 6 89.3 27
## 7 87.1 27
## 8 88.6 27
## 9 89.7 27
## 10 87.9 26
## Imp_thousand_barrels_per_day Gas_price_date Gas_price_per_gallon
## 1 6395 2022-03-14 4.252
## 2 6486 2022-03-21 4.165
## 3 6259 2022-03-28 4.152
## 4 10791 2008-07-14 4.102
## 5 10168 2008-06-30 4.075
## 6 10259 2008-06-16 4.056
## 7 9806 2008-07-21 4.054
## 8 10251 2008-06-23 4.051
## 9 9786 2008-06-02 3.980
## 10 8959 2008-05-26 3.960
From the above table, it is evident that the gas prices were high during March 2022 and throughout 2008. Looking at some external factors that occurred during this period of time, it makes sense why these would consist of the highest gas prices. In 2008, the energy crisis was coming to an end, and the crude price per barrel hit its peak. Factors such as a surge in demand and international tension led gas prices to be higher. In 2022, certain factors such as decreased oil production due to COVID, sanctions on Russian oil, and post-pandemic demand increases, gas prices have been steadily rising throughout March and much of April. When evaluating gas prices, it’s important to analyze various factors such as these, in addition to factors such as the crude price per barrel, stocks, etc.
ts_plot(gas_price)
## Warning in ts_plot(gas_price): There are multipe 'date' objects in the data
## frame,using the first one object as the plot index
We also created an interactive time series plot which can be used to select between all of the variables in the data set to view the trends. Crude price per barrel, for example, had a similar trend as the gas prices did, which makes sense given that they are highly correlated. There was a sharp peak and decrease in 2008, followed by a short increases and decreases before increasing steadily in 2021. You can also see that the stocks variable had a decline throughout the time period in which this data was obtained.
ts_gas_price <- gas_price %>%
select(Date, Gas_price_per_gallon)
ts_plot(ts_gas_price)
In this next section, we created graphs to show the distribution of data for the various variables. We created histograms and density plots, as well as boxplots and violin plots.
ggplot(gas_price) + geom_histogram(aes(x=Gas_price_per_gallon), fill="#FF9999") +
ggtitle("Distribution of gas prices", subtitle = "U.S. Weekly Retail Gas Prices (1994-2022)") +
xlab("Selling price") +
ylab("Number of gallons")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(gas_price) + geom_density(aes(x=Gas_price_per_gallon))
For the price of gas per gallon, most of the data is below $3.00 per gallon, with a max of $4.25 per gallon.
ggplot(gas_price) + geom_histogram(aes(x=Crude_price_per_barrel), fill="#FF9999") +
ggtitle("Distribution of crude oil prices", subtitle = "U.S. daily Crude oil Prices (1994-2022)") +
xlab("Selling price") +
ylab("Number of barrels")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(gas_price) + geom_density(aes(x=Crude_price_per_barrel))
For crude price per barrel, most of these were in the lower range as well, with the majority of these at less than a $100 per barrel price point. The peak was at $20 per barrel, and the maximum selling price was approximately $145 per barrel.
ggplot(gas_price) + geom_histogram(aes(x=Stocks_per_thousand_barrels), fill="#FF9999") +
ggtitle("Distribution of stocks of gasoline", subtitle = "U.S. weekly stocks of gasoline (1994-2022)") +
xlab("Stocks value") +
ylab("Number of thousand barrels")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(gas_price) + geom_density(aes(x=Stocks_per_thousand_barrels))
The distribution of stocks per thousand barrels is bimodal, with extremes on both ends of the range and a dip in the middle. There is a wide range for this data, from approximately $20,000 to $80,000 per thousand barrels.
ggplot(data = gas_price) +
geom_boxplot(aes(y=Gas_price_per_gallon, x=Stocks_per_thousand_barrels))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = gas_price) +
geom_boxplot(aes(y=Crude_price_per_barrel, x=Stocks_per_thousand_barrels))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = gas_price) +
geom_boxplot(aes(y=Crude_price_per_barrel, x=Gas_price_per_gallon))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = gas_price) +
geom_violin(aes(y=Crude_price_per_barrel, x=Gas_price_per_gallon))
skim(gas_price)
| Name | gas_price |
| Number of rows | 1006 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| Date | 2 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 1994-11-25 | 2022-03-25 | 2008-08-04 | 1006 |
| Gas_price_date | 0 | 1 | 1994-11-28 | 2022-03-28 | 2008-08-07 | 1006 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Crude_price_per_barrel | 0 | 1 | 53.45 | 28.54 | 3.32 | 27.87 | 50.78 | 72.93 | 139.95 | ▇▇▆▅▁ |
| Stocks_per_thousand_barrels | 0 | 1 | 94755.21 | 57304.59 | 16585.00 | 27931.50 | 101114.00 | 153529.00 | 180559.00 | ▇▃▃▃▇ |
| No_of_days | 0 | 1 | 24.62 | 2.54 | 19.80 | 23.10 | 24.20 | 25.60 | 48.70 | ▇▂▁▁▁ |
| Utilization_percentage | 0 | 1 | 89.69 | 5.13 | 56.00 | 87.10 | 90.35 | 93.30 | 100.50 | ▁▁▁▇▆ |
| Exp_thousand_barrels_per_day | 0 | 1 | 561.20 | 1034.38 | 10.00 | 30.00 | 95.00 | 420.00 | 4462.00 | ▇▁▁▁▁ |
| Imp_thousand_barrels_per_day | 0 | 1 | 8403.97 | 1318.47 | 4599.00 | 7514.25 | 8387.50 | 9414.50 | 11324.00 | ▁▃▇▇▃ |
| Gas_price_per_gallon | 0 | 1 | 2.27 | 0.85 | 0.93 | 1.45 | 2.29 | 2.86 | 4.25 | ▇▅▇▃▂ |
The above Skim function shows us the summary of our dataset in terms of the type of columns in our dataframe.
ggplot(gas_price, aes(x = Exp_thousand_barrels_per_day)) +
ggtitle("Distribution of exported barrels", subtitle = "U.S. Weekly Exports of Crude Oil (1994-2022)") +
xlab("Amount exported") +
ylab("Density") +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "#FF9999") +
geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For the distribution of exported barrels, there is a right-skewed distribution as seen above, showing that there was typically a low amount of exported barrels.
ggplot(gas_price, aes(x = Imp_thousand_barrels_per_day)) +
ggtitle("Distribution of imported barrels", subtitle = "U.S. Weekly Imports of Crude Oil (1994-2022)") +
xlab("Amount imported") +
ylab("Density") +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "#FF9999") +
geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The amount imported has a slight left-skew, but this is less extreme than the skew on the amount imported.
ggplot(gas_price) + geom_density(aes(x=Utilization_percentage), fill="#FF9999") +
ggtitle("Percent Utilization of Refinery Operable Capacity", subtitle = "U.S. Weekly Percent Utilization (1994-2022)") +
xlab("Percent utilization") +
ylab("Density")
The percent utilization has a left skew, showing that there was typically a high percent utilization of the refinery operable capacity.
ggplot(gas_price) + geom_density(aes(x=No_of_days), fill="#FF9999") +
ggtitle("Days of supply of total gasoline", subtitle = "U.S. Weekly Days of Total Supply (1994-2022") +
xlab("Number of days") +
ylab("Density")
The days of supply variable has a right skew, and is centered around the 25 day mark. This measures the adequacy of inventories, and remains relatively consistent throughtout the data, primarily between the 20 and 30 day marks.
cols_for_corr <- c(2:7, 9)
gasprice_cor <- cor(gas_price[, cols_for_corr])
gasprice_cor
## Crude_price_per_barrel Stocks_per_thousand_barrels
## Crude_price_per_barrel 1.00000000 -0.5947264
## Stocks_per_thousand_barrels -0.59472636 1.0000000
## No_of_days -0.18501640 -0.1964410
## Utilization_percentage -0.35799013 0.3475956
## Exp_thousand_barrels_per_day 0.05832063 -0.6024818
## Imp_thousand_barrels_per_day 0.06797031 0.4708912
## Gas_price_per_gallon 0.96773237 -0.7217388
## No_of_days Utilization_percentage
## Crude_price_per_barrel -0.1850164 -0.3579901
## Stocks_per_thousand_barrels -0.1964410 0.3475956
## No_of_days 1.0000000 -0.3143965
## Utilization_percentage -0.3143965 1.0000000
## Exp_thousand_barrels_per_day 0.4088613 -0.2065698
## Imp_thousand_barrels_per_day -0.5678313 0.2243003
## Gas_price_per_gallon -0.1388806 -0.3598479
## Exp_thousand_barrels_per_day
## Crude_price_per_barrel 0.05832063
## Stocks_per_thousand_barrels -0.60248177
## No_of_days 0.40886134
## Utilization_percentage -0.20656981
## Exp_thousand_barrels_per_day 1.00000000
## Imp_thousand_barrels_per_day -0.64120019
## Gas_price_per_gallon 0.19010678
## Imp_thousand_barrels_per_day Gas_price_per_gallon
## Crude_price_per_barrel 0.06797031 0.96773237
## Stocks_per_thousand_barrels 0.47089115 -0.72173878
## No_of_days -0.56783132 -0.13888060
## Utilization_percentage 0.22430029 -0.35984791
## Exp_thousand_barrels_per_day -0.64120019 0.19010678
## Imp_thousand_barrels_per_day 1.00000000 -0.03379215
## Gas_price_per_gallon -0.03379215 1.00000000
corrplot::corrplot(gasprice_cor)
The above correlation plot exhibits the relationship between the variables of our dataset. We can see that Crude Price and Stocks have a stronger correlation with Gas Price, the Crude Price has a strong Positive correlation whereas Stocks has a strong Negative correlation with Gas Price.
# Simple partition into train (70%) and test (30%) set
set.seed(447)
trainIndex <- createDataPartition(gas_price$Gas_price_per_gallon, p = .7,
list = FALSE,
times = 1)
gas_price_train <- gas_price[as.vector(trainIndex), ]
gas_price_test <- gas_price[-as.vector(trainIndex), ]
Scatter plots
ggplot(gas_price_train) + geom_point(aes(x=Crude_price_per_barrel, y=Gas_price_per_gallon))
ggplot(gas_price_train) + geom_point(aes(x=Stocks_per_thousand_barrels, y=Gas_price_per_gallon))
mean(gas_price_train$Gas_price_per_gallon)
## [1] 2.273307
gasprice_lm0 <- lm(Gas_price_per_gallon ~ 1, data = gas_price_train)
summary(gasprice_lm0)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ 1, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34731 -0.82206 0.01869 0.58519 1.97869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.27331 0.03224 70.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8565 on 705 degrees of freedom
# Compute overall mean Gas Price
null_pred <- mean(gas_price_train$Gas_price_per_gallon)
sprintf("Null model prediction: %.2f", null_pred)
## [1] "Null model prediction: 2.27"
# Compute null model MAE on train
null_train_mae <- MAE(gas_price_train$Gas_price_per_gallon, null_pred)
sprintf("Null model train MAE: %.2f", null_train_mae)
## [1] "Null model train MAE: 0.73"
# Compute null model MAE on test
null_test_mae <- MAE(gas_price_test$Gas_price_per_gallon, null_pred)
sprintf("Null model test MAE: %.2f", null_test_mae)
## [1] "Null model test MAE: 0.72"
gasprice_lm1 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel + Stocks_per_thousand_barrels + No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm1)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel +
## Stocks_per_thousand_barrels + No_of_days + Utilization_percentage +
## Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day,
## data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41574 -0.08708 -0.01783 0.05478 0.49003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.041e+00 1.714e-01 6.076 2.02e-09 ***
## Crude_price_per_barrel 2.478e-02 2.928e-04 84.610 < 2e-16 ***
## Stocks_per_thousand_barrels -3.387e-06 1.723e-07 -19.664 < 2e-16 ***
## No_of_days -7.583e-03 2.844e-03 -2.666 0.00785 **
## Utilization_percentage 3.616e-03 1.241e-03 2.913 0.00370 **
## Exp_thousand_barrels_per_day 2.216e-05 7.934e-06 2.794 0.00536 **
## Imp_thousand_barrels_per_day 8.294e-06 6.115e-06 1.356 0.17545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1398 on 699 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9734
## F-statistic: 4293 on 6 and 699 DF, p-value: < 2.2e-16
vif(gasprice_lm1)
## Crude_price_per_barrel Stocks_per_thousand_barrels
## 2.578147 3.497961
## No_of_days Utilization_percentage
## 1.732759 1.437350
## Exp_thousand_barrels_per_day Imp_thousand_barrels_per_day
## 2.309302 2.386561
as we can see in the above Variance Inflation Factor the values for the independent variables were less than 5 which indicates that there is no issue of multicollinearity with this model.
mae_train <- c(MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm0$fitted.values),
MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm1$fitted.values)
)
mae_train
## [1] 0.7285952 0.1031485
mae_train_lm1 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm1$fitted.values)
mae_train_lm1
## [1] 0.1031485
predict_lm1 <- predict(gasprice_lm1, newdata = gas_price_test)
mae_test <- c(MAE(gas_price_test$Gas_price_per_gallon, null_pred),
MAE(gas_price_test$Gas_price_per_gallon, predict_lm1)
)
mae_test
## [1] 0.7208292 0.1081121
The gasprice_lm1 fits pretty well as it has r-square value of 0.9736 which is closer to 1 and almost all of them are significant variables except for import data variable. Whereas the MAE for gasprice_lm1 is lower than MAE for gasprice_lm0 since gasprice_lm1 contains additional variables.
gasprice_lm2 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm2)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels +
## No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day +
## Imp_thousand_barrels_per_day, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25836 -0.27712 -0.03198 0.29967 1.82858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.576e+00 5.307e-01 12.391 < 2e-16 ***
## Stocks_per_thousand_barrels -1.349e-05 4.161e-07 -32.410 < 2e-16 ***
## No_of_days -5.415e-02 9.349e-03 -5.792 1.05e-08 ***
## Utilization_percentage -3.077e-02 3.930e-03 -7.829 1.83e-14 ***
## Exp_thousand_barrels_per_day -1.622e-04 2.556e-05 -6.346 3.98e-10 ***
## Imp_thousand_barrels_per_day 1.382e-04 1.983e-05 6.971 7.30e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4684 on 700 degrees of freedom
## Multiple R-squared: 0.703, Adjusted R-squared: 0.7009
## F-statistic: 331.4 on 5 and 700 DF, p-value: < 2.2e-16
mae_train_lm2 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm2$fitted.values)
mae_train_lm2
## [1] 0.3571621
gasprice_lm3 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm3)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels +
## No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day,
## data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15973 -0.30274 -0.02469 0.30241 1.91508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.615e+00 4.575e-01 18.832 < 2e-16 ***
## Stocks_per_thousand_barrels -1.289e-05 4.208e-07 -30.630 < 2e-16 ***
## No_of_days -8.409e-02 8.581e-03 -9.799 < 2e-16 ***
## Utilization_percentage -3.252e-02 4.053e-03 -8.024 4.32e-15 ***
## Exp_thousand_barrels_per_day -2.331e-04 2.423e-05 -9.618 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4841 on 701 degrees of freedom
## Multiple R-squared: 0.6824, Adjusted R-squared: 0.6806
## F-statistic: 376.5 on 4 and 701 DF, p-value: < 2.2e-16
mae_train_lm3 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm3$fitted.values)
mae_train_lm3
## [1] 0.3745696
gasprice_lm4 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days + Utilization_percentage, data = gas_price_train)
summary(gasprice_lm4)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels +
## No_of_days + Utilization_percentage, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29993 -0.33327 -0.07542 0.29519 1.65483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.438e+00 4.778e-01 19.752 <2e-16 ***
## Stocks_per_thousand_barrels -1.057e-05 3.668e-07 -28.824 <2e-16 ***
## No_of_days -1.159e-01 8.417e-03 -13.772 <2e-16 ***
## Utilization_percentage -3.680e-02 4.283e-03 -8.591 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5147 on 702 degrees of freedom
## Multiple R-squared: 0.6405, Adjusted R-squared: 0.6389
## F-statistic: 416.9 on 3 and 702 DF, p-value: < 2.2e-16
mae_train_lm4 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm4$fitted.values)
mae_train_lm4
## [1] 0.4032406
gasprice_lm5 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days, data = gas_price_train)
summary(gasprice_lm5)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels +
## No_of_days, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10842 -0.38376 -0.07734 0.32972 1.79376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.744e+00 2.190e-01 26.23 <2e-16 ***
## Stocks_per_thousand_barrels -1.165e-05 3.621e-07 -32.17 <2e-16 ***
## No_of_days -9.582e-02 8.494e-03 -11.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5407 on 703 degrees of freedom
## Multiple R-squared: 0.6027, Adjusted R-squared: 0.6015
## F-statistic: 533.2 on 2 and 703 DF, p-value: < 2.2e-16
mae_train_lm5 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm5$fitted.values)
mae_train_lm5
## [1] 0.429309
gasprice_lm6 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels, data = gas_price_train)
summary(gasprice_lm6)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels,
## data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2823 -0.4074 -0.1600 0.4253 1.9211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.314e+00 4.300e-02 77.07 <2e-16 ***
## Stocks_per_thousand_barrels -1.091e-05 3.868e-07 -28.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5871 on 704 degrees of freedom
## Multiple R-squared: 0.5307, Adjusted R-squared: 0.5301
## F-statistic: 796.3 on 1 and 704 DF, p-value: < 2.2e-16
mae_train_lm6 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm6$fitted.values)
mae_train_lm6
## [1] 0.4858075
gasprice_lm7 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm7)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels +
## No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day +
## Crude_price_per_barrel, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41145 -0.08643 -0.01746 0.05633 0.49323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.134e+00 1.573e-01 7.205 1.51e-12 ***
## Stocks_per_thousand_barrels -3.313e-06 1.634e-07 -20.270 < 2e-16 ***
## No_of_days -9.078e-03 2.623e-03 -3.460 0.000572 ***
## Utilization_percentage 3.656e-03 1.242e-03 2.944 0.003348 **
## Exp_thousand_barrels_per_day 1.892e-05 7.570e-06 2.500 0.012659 *
## Crude_price_per_barrel 2.488e-02 2.836e-04 87.711 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1399 on 700 degrees of freedom
## Multiple R-squared: 0.9735, Adjusted R-squared: 0.9733
## F-statistic: 5145 on 5 and 700 DF, p-value: < 2.2e-16
mae_train_lm7 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm7$fitted.values)
mae_train_lm7
## [1] 0.1034463
gasprice_lm8 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel + Stocks_per_thousand_barrels + No_of_days + Utilization_percentage, data = gas_price_train)
summary(gasprice_lm8)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel +
## Stocks_per_thousand_barrels + No_of_days + Utilization_percentage,
## data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41894 -0.08305 -0.01882 0.06074 0.48492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.157e+00 1.576e-01 7.342 5.87e-13 ***
## Crude_price_per_barrel 2.461e-02 2.634e-04 93.431 < 2e-16 ***
## Stocks_per_thousand_barrels -3.578e-06 1.250e-07 -28.627 < 2e-16 ***
## No_of_days -7.677e-03 2.572e-03 -2.985 0.00294 **
## Utilization_percentage 3.561e-03 1.246e-03 2.859 0.00438 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1404 on 701 degrees of freedom
## Multiple R-squared: 0.9733, Adjusted R-squared: 0.9731
## F-statistic: 6382 on 4 and 701 DF, p-value: < 2.2e-16
mae_train_lm8 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm8$fitted.values)
mae_train_lm8
## [1] 0.1046638
gasprice_lm9 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel + Stocks_per_thousand_barrels + No_of_days, data = gas_price_train)
summary(gasprice_lm9)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel +
## Stocks_per_thousand_barrels + No_of_days, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44504 -0.08533 -0.01868 0.06310 0.51012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.560e+00 7.135e-02 21.861 < 2e-16 ***
## Crude_price_per_barrel 2.435e-02 2.483e-04 98.053 < 2e-16 ***
## Stocks_per_thousand_barrels -3.560e-06 1.255e-07 -28.376 < 2e-16 ***
## No_of_days -1.054e-02 2.382e-03 -4.424 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1411 on 702 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.9728
## F-statistic: 8421 on 3 and 702 DF, p-value: < 2.2e-16
mae_train_lm9 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm9$fitted.values)
mae_train_lm9
## [1] 0.1053172
gasprice_lm10 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel + Stocks_per_thousand_barrels, data = gas_price_train)
summary(gasprice_lm10)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel +
## Stocks_per_thousand_barrels, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41985 -0.08728 -0.01990 0.05855 0.49917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.259e+00 2.209e-02 57.01 <2e-16 ***
## Crude_price_per_barrel 2.475e-02 2.342e-04 105.68 <2e-16 ***
## Stocks_per_thousand_barrels -3.357e-06 1.183e-07 -28.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.143 on 703 degrees of freedom
## Multiple R-squared: 0.9722, Adjusted R-squared: 0.9721
## F-statistic: 1.23e+04 on 2 and 703 DF, p-value: < 2.2e-16
mae_train_lm10 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm10$fitted.values)
mae_train_lm10
## [1] 0.1052105
gasprice_lm11 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm11)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5648 -0.1388 -0.0510 0.1604 0.7260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.722287 0.016696 43.26 <2e-16 ***
## Crude_price_per_barrel 0.028765 0.000273 105.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2093 on 704 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9403
## F-statistic: 1.11e+04 on 1 and 704 DF, p-value: < 2.2e-16
mae_train_lm11 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm11$fitted.values)
mae_train_lm11
## [1] 0.1704284
gasprice_lm12 <- lm(Gas_price_per_gallon ~ No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day + Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm12)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage +
## Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day +
## Crude_price_per_barrel, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47608 -0.10762 -0.03560 0.09335 0.66852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.552e-01 2.131e-01 4.013 6.63e-05 ***
## No_of_days -9.652e-03 3.540e-03 -2.727 0.00655 **
## Utilization_percentage 3.388e-03 1.546e-03 2.191 0.02875 *
## Exp_thousand_barrels_per_day 9.864e-05 8.612e-06 11.453 < 2e-16 ***
## Imp_thousand_barrels_per_day -2.986e-05 7.222e-06 -4.135 3.98e-05 ***
## Crude_price_per_barrel 2.877e-02 2.629e-04 109.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1741 on 700 degrees of freedom
## Multiple R-squared: 0.959, Adjusted R-squared: 0.9587
## F-statistic: 3272 on 5 and 700 DF, p-value: < 2.2e-16
mae_train_lm12 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm12$fitted.values)
mae_train_lm12
## [1] 0.1335655
gasprice_lm13 <- lm(Gas_price_per_gallon ~ No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm13)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage +
## Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day,
## data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7298 -0.5789 -0.1056 0.5001 1.9294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.198e+01 7.961e-01 15.044 < 2e-16 ***
## No_of_days -1.274e-01 1.433e-02 -8.891 < 2e-16 ***
## Utilization_percentage -7.492e-02 5.825e-03 -12.862 < 2e-16 ***
## Exp_thousand_barrels_per_day 1.961e-04 3.642e-05 5.386 9.85e-08 ***
## Imp_thousand_barrels_per_day 6.274e-06 3.067e-05 0.205 0.838
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7402 on 701 degrees of freedom
## Multiple R-squared: 0.2573, Adjusted R-squared: 0.2531
## F-statistic: 60.73 on 4 and 701 DF, p-value: < 2.2e-16
mae_train_lm13 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm13$fitted.values)
mae_train_lm13
## [1] 0.6063415
gasprice_lm14 <- lm(Gas_price_per_gallon ~ No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm14)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage +
## Exp_thousand_barrels_per_day, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7276 -0.5793 -0.1012 0.4996 1.9293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.206e+01 6.776e-01 17.801 < 2e-16 ***
## No_of_days -1.287e-01 1.292e-02 -9.960 < 2e-16 ***
## Utilization_percentage -7.491e-02 5.821e-03 -12.869 < 2e-16 ***
## Exp_thousand_barrels_per_day 1.920e-04 3.036e-05 6.326 4.49e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7397 on 702 degrees of freedom
## Multiple R-squared: 0.2573, Adjusted R-squared: 0.2541
## F-statistic: 81.07 on 3 and 702 DF, p-value: < 2.2e-16
mae_train_lm14 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm14$fitted.values)
mae_train_lm14
## [1] 0.6064749
gasprice_lm15 <- lm(Gas_price_per_gallon ~ No_of_days + Utilization_percentage, data = gas_price_train)
summary(gasprice_lm15)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage,
## data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8058 -0.6348 -0.1168 0.5862 2.3328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.813596 0.694978 16.999 < 2e-16 ***
## No_of_days -0.099518 0.012401 -8.025 4.26e-15 ***
## Utilization_percentage -0.079018 0.005943 -13.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.76 on 703 degrees of freedom
## Multiple R-squared: 0.215, Adjusted R-squared: 0.2127
## F-statistic: 96.25 on 2 and 703 DF, p-value: < 2.2e-16
mae_train_lm15 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm15$fitted.values)
mae_train_lm15
## [1] 0.6472869
gasprice_lm16 <- lm(Gas_price_per_gallon ~ No_of_days, data = gas_price_train)
summary(gasprice_lm16)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28700 -0.88130 0.04211 0.56481 2.10313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42011 0.32490 10.527 < 2e-16 ***
## No_of_days -0.04657 0.01313 -3.547 0.000416 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8496 on 704 degrees of freedom
## Multiple R-squared: 0.01756, Adjusted R-squared: 0.01616
## F-statistic: 12.58 on 1 and 704 DF, p-value: 0.0004157
mae_train_lm16 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm16$fitted.values)
mae_train_lm16
## [1] 0.725579
gasprice_lm17 <- lm(Gas_price_per_gallon ~ Utilization_percentage + Crude_price_per_barrel + Stocks_per_thousand_barrels + No_of_days + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm17)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Utilization_percentage +
## Crude_price_per_barrel + Stocks_per_thousand_barrels + No_of_days +
## Imp_thousand_barrels_per_day, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42106 -0.08349 -0.01973 0.05977 0.48856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.124e+00 1.696e-01 6.626 6.88e-11 ***
## Utilization_percentage 3.540e-03 1.247e-03 2.839 0.00466 **
## Crude_price_per_barrel 2.455e-02 2.829e-04 86.775 < 2e-16 ***
## Stocks_per_thousand_barrels -3.623e-06 1.509e-07 -24.015 < 2e-16 ***
## No_of_days -7.019e-03 2.851e-03 -2.462 0.01405 *
## Imp_thousand_barrels_per_day 3.146e-06 5.859e-06 0.537 0.59145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1405 on 700 degrees of freedom
## Multiple R-squared: 0.9733, Adjusted R-squared: 0.9731
## F-statistic: 5101 on 5 and 700 DF, p-value: < 2.2e-16
mae_train_lm17 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm17$fitted.values)
mae_train_lm17
## [1] 0.1046834
gasprice_lm18 <- lm(Gas_price_per_gallon ~ Utilization_percentage + Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm18)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Utilization_percentage +
## Crude_price_per_barrel, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56836 -0.14031 -0.05015 0.15903 0.73363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8741178 0.1574309 5.552 4e-08 ***
## Utilization_percentage -0.0016266 0.0016771 -0.970 0.332
## Crude_price_per_barrel 0.0286561 0.0002954 97.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2093 on 703 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9403
## F-statistic: 5550 on 2 and 703 DF, p-value: < 2.2e-16
mae_train_lm18 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm18$fitted.values)
mae_train_lm18
## [1] 0.1700321
gasprice_lm19 <- lm(Gas_price_per_gallon ~ Utilization_percentage, data = gas_price_train)
summary(gasprice_lm19)
##
## Call:
## lm(formula = Gas_price_per_gallon ~ Utilization_percentage, data = gas_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91231 -0.68783 -0.05725 0.61916 2.03036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.988488 0.528051 15.13 <2e-16 ***
## Utilization_percentage -0.063701 0.005876 -10.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7934 on 704 degrees of freedom
## Multiple R-squared: 0.143, Adjusted R-squared: 0.1418
## F-statistic: 117.5 on 1 and 704 DF, p-value: < 2.2e-16
mae_train_lm19 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm19$fitted.values)
mae_train_lm19
## [1] 0.6769605
maes <- c(mae_train_lm1,
mae_train_lm2,
mae_train_lm3,
mae_train_lm4,
mae_train_lm5,
mae_train_lm6,
mae_train_lm7,
mae_train_lm8,
mae_train_lm9,
mae_train_lm10,
mae_train_lm11,
mae_train_lm12,
mae_train_lm13,
mae_train_lm14,
mae_train_lm15,
mae_train_lm16,
mae_train_lm17,
mae_train_lm18,
mae_train_lm19)
maes
## [1] 0.1031485 0.3571621 0.3745696 0.4032406 0.4293090 0.4858075 0.1034463
## [8] 0.1046638 0.1053172 0.1052105 0.1704284 0.1335655 0.6063415 0.6064749
## [15] 0.6472869 0.7255790 0.1046834 0.1700321 0.6769605
rank(maes)
## [1] 1 10 11 12 13 14 2 3 6 5 9 7 15 16 17 19 4 8 18
So, lm1 has the lowest MAE followed by lm7 and then lm8.
Let’s see how these three models do in predicting the test data values.
pred_lm1 <- predict(gasprice_lm1, newdata = gas_price_test)
MAE(pred_lm1, gas_price_test$Gas_price_per_gallon)
## [1] 0.1081121
pred_lm7 <- predict(gasprice_lm7, newdata = gas_price_test)
MAE(pred_lm7, gas_price_test$Gas_price_per_gallon)
## [1] 0.1084135
pred_lm8 <- predict(gasprice_lm8, newdata = gas_price_test)
MAE(pred_lm8, gas_price_test$Gas_price_per_gallon)
## [1] 0.1096322
For our top 3 models, we have created a scatter plot showing actual vs fitted values of Gas Price.
Here are the scatters on the training data.
scatter_train_df <- data.frame(gasprice_actual = gas_price_train[,"Gas_price_per_gallon"],
lm1 = gasprice_lm1$fitted.values,
lm7 = gasprice_lm7$fitted.values,
lm8 = gasprice_lm8$fitted.values)
ggplot(scatter_train_df) + geom_point(aes(x=lm1, y=gasprice_actual)) + geom_abline(color='red')
ggplot(scatter_train_df) + geom_point(aes(x=lm7, y=gasprice_actual)) + geom_abline(color='red')
ggplot(scatter_train_df) + geom_point(aes(x=lm8, y=gasprice_actual)) + geom_abline(color='red')
The scatters confirm the relative rankings of our top 3 models
scatter_test_df <- data.frame(gasprice_actual = gas_price_test[,"Gas_price_per_gallon"],
lm1 = pred_lm1,
lm7 = pred_lm7,
lm8 = pred_lm8)
head(scatter_test_df)
## gasprice_actual lm1 lm7 lm8
## 5 3.480 3.498988 3.497462 3.472433
## 7 3.321 3.325307 3.319293 3.304099
## 8 3.271 3.304913 3.304877 3.281417
## 10 3.216 3.103842 3.113279 3.091159
## 11 3.211 3.019522 3.022481 2.992055
## 13 3.252 3.026161 3.029309 2.985269
ggplot(scatter_test_df) + geom_point(aes(x=lm1, y=gasprice_actual)) + geom_abline(color='red')
ggplot(scatter_test_df) + geom_point(aes(x=lm7, y=gasprice_actual)) + geom_abline(color='red')
ggplot(scatter_test_df) + geom_point(aes(x=lm8, y=gasprice_actual)) + geom_abline(color='red')
The scatters on test look better than the scatters on train.
ggplot(data.frame(residuals=gasprice_lm1$residuals,
fitted=gasprice_lm1$fitted.values)) + geom_point(aes(x=fitted, y=residuals))
The plot does not have constant variance as it is more of a fanned out shape which leads the selected top model towards being “Heteroskedastic”.
For each of our top 3 models, we made predictions for Gas Price using gas_price_test.
price_predict <- data.frame(predict1=predict(gasprice_lm1, newdata=gas_price_test),
predict7=predict(gasprice_lm7, newdata=gas_price_test),
predict8=predict(gasprice_lm8, newdata=gas_price_test),
price_actual=gas_price_test$Gas_price_per_gallon)
We then computed the MAE for each of the three models’ predictions on the test data.
price_predict_MAE <- data.frame(predict1_MAE=MAE(price_predict$predict1, gas_price_test$Gas_price_per_gallon),
predict7_MAE=MAE(price_predict$predict7, gas_price_test$Gas_price_per_gallon),
predict8_MAE=MAE(price_predict$predict8, gas_price_test$Gas_price_per_gallon))
price_predict_MAE
## predict1_MAE predict7_MAE predict8_MAE
## 1 0.1081121 0.1084135 0.1096322
The top model is “gasprice_lm1” for which we considered all the independent variables. The model makes sense in terms of the variables included as we can see from the correlation matrix the variable “Crude Price” had a strong positive correlation while “Stocks” had a strong negative correlation with Gas Price and same signs were depicted in the coeeficient estimate value for these variables. The reason behind selecting all variables is that almost all of them are significant variables except for import data variable and this model has the lowest MAE value for both training (0.1031485) and test set (0.1081121). Whereas the R-squared value is the highest 0.9736 being a good model fit.
str(gas_price_test$Date)
## Date[1:300], format: "2022-02-18" "2022-01-28" "2022-01-21" "2021-12-31" "2021-12-24" ...
finalpredtestData<-data.frame(gas_price_test$Date,price_predict$predict7,gas_price_test$Gas_price_per_gallon )
colnames(finalpredtestData) <-c("Date","GasPricePredicted", "GasPriceActual")
ts_plot(finalpredtestData)
In the above time series plot showcases both the actual and predicted gas price values of our champion model we can observe that both the lines are following closely with each other across the years and proves that our model fits well. But for the year 2020 probably due to COVID as a factor which caused a drastic decrease in demand so in reality the gas prices went down and we can see the model’s predicted gas price was able to follow the trend but was just a little off by 0.70 cents.